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^H Summary 

P^ In recent years, there has been considerable interest in estimating conditional independence 

. graphs in the high-dimensional setting. Most prior work has assumed that the variables are multi- 

CIh variate Gaussian, or that the conditional means of the variables are linear. Unfortunately, if these 

O^ assumptions are violated, then the resulting conditional independence estimates can be inaccu- 

QQ rate. We present a semi-parametric method, SpaCE JAM, which allows the conditional means 

1-H of the features to take on an arbitrary additive form. We present an efficient algorithm for its 

^^ computation, and prove that our estimator is consistent. We also extend our method to estima- 

nj tion of directed graphs with known causal ordering. Using simulated data, we show that SpaCE 

^ JAM enjoys superior performance to existing methods when there are non-linear relationships 

'^ among the features, and is comparable to methods that assume multivariate normality when the 

j^ conditional means are linear. We illustrate our method on a cell-signaling data set. 
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■^ In recent years, there has been considerable interest in developing methods to estimate the 
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__J joint pattern of association among a set of random variables. The relationships between d random 
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1. Introduction 



variables can be summarized with an undirected graph F = {V,E)m which the random variables 
are represented by the vertices F = { 1 , . . . , d} and the conditional dependencies between pairs 
of variables are represented by edges E C V x V. That is, for each j G V, we want to determine 
a minimal set of variables on which the conditional densities pj {xj \ {xk,k ^ j}) depend, 

Pj{xj I {xk, k / j}) = pj{xj I {xk ■■ {k,j) G E}). 

Recently there has also been considerable work in estimating marginal associations between a 
set of random variables (see e.g. Basso et al., 2005; Meyer et al., 2008; Liang & Wang, 2008; 



A. VOORMAN, A. SHOJAIE, D. WITTEN 







-3-2-10123 



PKC 
(d) 



Fig. 1: Cell signaling data from Sachs et al. (2005). (a)-(c) Pairwise scatterplots for PKC, P38 
and PJNK. (d) Partial residuals from the linear regression of P38 on PKC and PJNK. The data 
are standardized to have normal marginal distributions, but are clearly not multivariate normal. 



Hausser & Strimmer, 2009; Chen et al., 2010); however, in this paper we focus on conditional 
dependencies, which provide richer information about the relationships among the variables. 

Estimating the conditional independence graph F based on a set of n observations is an old 
problem (Dempster, 1972). In the case of high-dimensional continuous data, most prior work 
has assumed either (a) multivariate Gaussianity (see e.g. Friedman et al., 2008; Rothman et al., 
2008; Yuan & Lin, 2007; Banerjee et al., 2008) or (b) linear conditional means (see e.g. Mein- 
shausen & Biihlmann, 2006; Peng et al., 2009) for the features. However, as we will see, these 
two assumptions are essentially equivalent. As an illustration, consider the cell signaling data 
set from Sachs et al. (2005), which consists of protein concentrations measured under a set of 
perturbations. We analyze the data set in more detail in Section 5-3. Pairwise scatterplots of three 
of the variables are given in Figure 1 (a)-(c) for one of 14 perturbations. Here, the data have been 
transformed to be marginally normal, as suggested by Liu et al. (2009). The transformed data 
clearly are not multivariate normal, given the non-constant variance in the bivariate scatterplots, 
and as confirmed by a Shapiro-Wilk test (p < 2 x 10^^^). 

Can the data in Figure 1 be well-represented by linear relationships? In Figure 1 (d), we see 
strong evidence that the conditional mean of the protein P38 given PKC and PJNK is nonlinear. 
This is corroborated by the fact that the p-value for including quadratic terms in the linear re- 
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gression of P38 onto PKC and PJNK is small (p < 2 x 10^^^). Therefore in this data set, the 

features are not multivariate Gaussian, and marginal transformations do not remedy the problem. 

In order to flexibly model conditional mean relationships, we could specify a more flexi- 
ble joint distribution. However, joint distributions are difficult to construct and computationally 
challenging to fit, and the resulting conditional models need not be easy to obtain or interpret. 
Alternatively we can specify the conditional distributions directly. This has the advantage of 
simpler interpretation and greater computational tractability. In this paper, we will model the 
conditional means of non-Gaussian random variables with generalized additive models (Hastie 
& Tibshirani, 1990), and will use these in order to construct conditional independence graphs. 

Throughout this paper, we will assume that we are given n independent and identically dis- 
tributed observations from a (i-dimensional random vector x = (xi, . . . , Xd) ^ V . Our observed 
data can be written as X = \x\^ . . .Xd\ G M"^"^. 

The rest of the paper is organized as follows. In Sections 2 and 3 we review methods for 
modeling conditional dependence relationships among a set of variables, and discuss their limi- 
tations. In Section 4 we propose our method (SpaCE JAM) and an algorithm for its computation. 
We illustrate our method on real and simulated data in Section 5, and compare with available 
methods. In Section 6 we extend the method to the estimation of directed acyclic graphs with 
known causal ordering. In Section 7 we prove consistency of our algorithm, and in Section 8 we 
propose a screening rule for estimation in high dimensions. The discussion is in Section 9. 

2. Modeling conditional dependence relationships 

Suppose we are interested in estimating the conditional independence graph F for a random 
vector x e W^. If the joint distribution is known up to some finite dimensional parameter Q, then 
to estimate F it suffices to estimate Q via e.g. maximum likelihood. One practical difficulty that 
arises in estimating F is specification of a plausible joint distribution. Specifying a conditional 
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distribution, such as in a regression model, is typically much less daunting. We therefore consider 

pseudo-likelihoods (Besag, 1974, 1975) of the form 

d 
log{ppL{x;e)) = ^log{pj{xj I {xk : {j,k) G E};e)) . 
i=i 

For a set of arbitrary conditional distributions, there need not be a compatible joint distribu- 
tion (Wang & Ip, 2008). However, the conditionally specified graphical model has an appealing 
theoretical justification, in that it minimizes the KuUback-Leibler distances to the conditional dis- 
tributions (Varin & Vidoni, 2005). Furthermore, in estimating conditional independence graphs, 
our scientific interest is in the conditional independence relationships rather than in the joint dis- 
tribution. So in a sense, modeling the conditional distribution rather than the joint distribution 
amounts to a more direct approach to graph estimation. We therefore advocate for an approach for 

non-Gaussian graphical modeling based on conditionally specified models (Varin et al., 2011). 

3. Previous work 
3-1. Estimating graphs with Gaussian data 

Suppose for now that x has a joint Gaussian distribution with mean and precision matrix 0. 
One can write the negative log-likelihood of the joint distribution, up to constants, as 

-logdet(e) + tr (xx^e) . (1) 

In this case, the conditional relationships are linear, 

Xj I {xk, A; / j} = ^ PjkXk + ej, j = l,...,d, (2) 

where /Sj^ = —Qjk/Qkk and ej ~ A'^i(0, l/@jj). To estimate the graph F, we must determine 
which (3jk are zero in (2), or equivalently which Bj^ are in (1). This is simple when n S> d. 

In the high-dimensional setting, when the maximum likelihood estimate is unstable or un- 
defined, a number of approaches have been proposed to estimate the conditional independence 
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graph r, which we review here. Meinshausen & Biihlmann (2006) proposed fitting (2) using an 



?i-penaUzed regression. This is referred to as neighborhood selection: 

d 1 

"" (3) 



{d d 1 



Here A is a nonnegative tuning parameter that encourages sparsity in the coefficient estimates. 
Peng et al. (2009) improved upon the neighborhood selection approach by applying i-i penalties 
to the partial correlations; this is known as sparse partial correlation estimation. 

As an alternative to (3), many authors have considered estimating B under the multivariate 
normality assumption by maximizing an £1 -penalized joint log likelihood (see e.g. Yuan & Lin, 
2007; Banerjee et al., 2008; Friedman et al., 2008). This amounts to the optimization problem 

e = argmin{-logdet(e) + tr {X^ XQ) /n + A||e||i} , (4) 

known as the graphical lasso. The solution B to (4) serves as an estimate for B, and hence the 
sparsity pattern of B (induced by the ^1 penalty) provides an estimate of V. 

At first glance, neighborhood selection and sparse partial correlation may seem semi- 
parametric: a linear model may hold in the absence of multivariate normality. However, while 
(2) can accurately model each conditional dependence relationship semi-parametrically, the ac- 
cumulation of these specifications is very restrictive in terms of the joint distribution. In fact, 
Khatri & Rao (1976) proved that if (2) holds, along with some other mild assumptions, then the 
joint distribution must be multivariate normal. Notably, this is true regardless of the distribution 
of the errors ei, . . . , e^ in (2). In other words, even though (3) does not explicitly involve the 
multivariate normal likelihood, normality is implicitly assumed. This means that if we wish to 
model non-normal continuous data, then non-linear conditional models are necessary. 
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3-2. Estimating graphs with non-Gaussian data 

We now briefly review three existing methods for modeUng conditional independence graphs 
with non-Gaussian data. The normal copula or nonparanormal model (Liu et al. 2009, Liu et al. 
2012, Xue & Zou 2012, studied in the Bayesian context by Dobra & Lenkoski 2011) assumes 
that X has a nonparanormal distribution: that is, (/ii(xi), . . . , hd{xd)) '^ Nd{0, G) for functions 
/ii(-), . . . , h(i{-). After /ii(-), . . . , hd{-) are estimated, one can apply any of the methods men- 
tioned in Section 3-1 to the transformed data. The conditional model implicit in this approach 
is 

hj{xj)\ {xk, k y^j} = '^l3jkhk{xk) + ej, j = l,...,d. (5) 

This is itself a restrictive assumption, which may not hold, as seen in Figure 1. 

Forest density estimation (Liu et al., 2011) replaces the need for distributional assumptions 
with graphical assumptions: the underlying graph is assumed to be a forest. Then bivariate den- 
sities are estimated non-parametrically. Unfortunately, the restriction to acyclic graphs may be 
inappropriate in applications, and maximizing over all possible forests is infeasible. 

The graphical random forests (Fellinghauer et al., 201 1) approach uses random forests to flex- 
ibly model conditional means, and allows for interaction terms. But this does not correspond to 
a well-defined statistical model, and guarantees on feature selection consistency are unavailable. 

4. Method 

4-1. Jointly additive models 

In order to estimate a conditional independence graph using a pseudolikelihood approach, we 

must estimate the variables on which the conditional distributions Pj{-) depend. However, since 

density estimation is generally a challenging task, especially in high dimensions, we focus on 
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the simpler problem of estimating the conditional mean E[xj | {xk '■ (j, k) G E} ], under the 

assumption that the conditional distribution and the conditional mean depend on the same set of 

variables. Thus, we seek to estimate the conditional mean /j(-) in the regression model 

Xj\{xk, k y^j} = fj {xk-.k^ j) + ej, 

where ej is a mean-zero error term. Since estimating arbitrary functions /, (•) is infeasible in high 
dimensions, we restrict ourselves to additive models of the form 

Xj\{xk, fc / i} = ^ fjk{xk) + ej, (6) 

where fjk{-) £ Tior some space of functions T. This amounts to modeling each variable using a 
generalized additive model (Hastie & Tibshirani, 1990). Unlike Fellinghauer et al. (2011), we do 
not assume that the errors ej are independent of the additive components fjk{-), but merely that 
the conditional independence structure can be recovered from the additive components fjk{-)- 

4-2. Estimation with SpaCE JAM 
Since we believe that the conditional independence graph is sparse, we fit (6) using a penalty 
that performs simultaneous estimation and selection of the fjk{-)- Specifically, we link together 
d sparse additive models (Ravikumar et al., 2009) using a penalty that groups the parameters 
corresponding to a single edge in the graph. This results in the problem 



f 1 '^ 

\^T. W^^ - E hk{xk)\\l + A j; {Ujk{Sk)\\l + WfkA^ 

\ i=l ki^j k>j 



mmimize^^ ^ ^ ^ "^^' ~ ^ fjkixkM + ^ 2^ {\\fjk(.Xk)\\i + \\fkj(.Xj)\\l)^ ^ } . (7) 



We consider fjk{xk) = '^ jkPjk^ where ^j^ is a n x r matrix whose columns are basis functions 
used to model the additive components fjk, and j3jk is an r-vector containing the associated co- 
efficients. For instance, if we use a linear basis function, i.e. ^^^ = Xk, then r = 1 and we are 
modeling only linear conditional means, as in Meinshausen & Biihlmann (2006). Higher-order 
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terms allow us to model more complex dependencies. The standardized group lasso penalty 

(Simon & Tibshirani, 2011) encourages sparsity and ensures that the estimates of fjk{-) and 
fkj{-) will be simultaneously zero or non-zero. Problem (7) is the natural extension of sparse 
additive modeling (Ravikumar et al., 2009) to graphs, and generalizes neighborhood selection 
(Meinshausen & Biihlmann, 2006) and sparse partial correlation (Peng et al., 2009) to allow for 
flexible conditional means. We call the solution to (7) SpaCE JAM (for SPArse Conditional Es- 
timation with Joint Additive Models), to reflect its ties with the aforementioned techniques. 



Algorithm 1. SpaCE JAM algorithm 

Initialize /3's 

Repeat until convergence: 
For ij, k) eV xV: 
1: Calculate the vector of residuals for the jth and fcth variables: 

rjk ^ Xj - zZi^j^k "^jil^ji 
Tkj ^ Xk- J2ij^j,k "^kSki 

2: Regress the residuals on the specified basis functions: 



3: Threshold: 



kk ^U-n\ {\\^jkkk\\l + W'^kjkjWt) ^ j Pjk 
hj ^ U-n\ (\\^jkkk\\l + W^kjhjWt) ) i^kj 



Algorithm 1 uses block coordinate descent to solve (7). Since (7) is convex, the algorithm con- 
verges to the global minimum (Simon & Tibshirani, 2011). Performing Step 2 requires an r x r 
matrix inversion, where r is the number of basis functions; this must be performed only twice 
per pair of variables. Estimating 30 conditional independence graphs with r = 3 on a simulated 
data set with n = 50 and d = 100 takes 1.1 seconds on a 2.8 GHz Intel Core 17 Macbook Pro. 
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4-3. Tuning 

A number of options for tuning parameter selection are available, such as generalized cross- 
validation (Tibshirani, 1996), the Bayesian information criterion (Zou et al., 2007), and stability 
selection (Meinshausen & Biihlmann, 2010). We take an approach motivated by the Bayesian 
information criterion, as in Peng et al. (2009). For the jth variable, the criterion is 

BiCj(A) = nlog(RSSj(A)) + log(n)DFj(A), (8) 

where RSSj(A) = \\xj — Ylk^j '^jkPjkW'i i^ ^^'^ residual sum of squares from minimizing (7) 
with tuning parameter A, and df(A)j is the degrees of freedom used in this regression. We seek 
the value of A that minimizes Ylij=i^^^jW- When a single basis function is used, we can 
approximate the degrees of freedom by the number of non-zero parameters in the regression 
(Zou et al., 2007; Peng et al., 2009). But when r > 1 basis functions are used, we use 

3(A)||2 



DP,(A)H^fV(-l)E^%fe' (9) 



where S- = {k : ||/3k || / 0}. Though (9) was derived under the assumption of an orthogonal 
design matrix, it is a good approximation for the non-orthogonal case (Yuan & Lin, 2006). 

In order to perform SpaCE JAM, we must select a set of basis functions. In the absence of 
domain knowledge, we use cubic polynomials, which can approximate a wide range of functions. 



5. Numerical experiments 
5-1. Simulation setup 
As discussed in Section 2, it can be difficult to specify flexible non-Gaussian distributions for 
continuous variables. However, construction of multivariate distributions via conditional distri- 
butions is straightforward when the variables can be represented with a directed acyclic graph. 
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The joint probability distribution of variables in a directed acyclic graph can be decomposed as 

p{xi, . . . , Xd) = Wj=iPj{xj\{xk '■ {k,j) G Ed}), where Ed denotes the directed edge set of 

the graph. This is a valid joint distribution regardless of the choice of conditional distributions 

Pj{xj\{xk : {k,j) G Ed}) (Pearl, 2000, Chapter 1.4). We chose structural equations of the form 

Xj\{xk : (kj) e Ed} = ^ fjk{xk)+(^j, (10) 

(fe,i)e-Bu 

with ej ~ A'^(0, 1). If the fj^ are chosen to be linear, then the data are multivariate normal, and 

if the fjk are non-linear, then the data will typically not correspond to a well-known multivariate 

distribution. We moralized the directed graph in order to obtain the conditional independence 

graph (Cowell et al., 2007, Chapter 3.2). Note that here we have used directed acyclic graphs 

simply as a tool to generate non-Gaussian data, and that the full conditional distributions of the 

random variables created using this approach are not necessarily additive. 

We first generated a directed acyclic graph with d = 100 nodes and 80 edges chosen at random 

from the ( g ) possible edges. We used two schemes to construct a distribution on this graph. 

In the first setting, we chose fjk{xk) = bjkiXk + bjk2xl + bjksxl, where the bjki, bjk2, and 

bjks are independent and normally distributed with mean zero and variance 1, 0.5, and 0.5, 

respectively. In the second case, we chose fjk{xk) = x^, resulting in multivariate normal data. 

In both cases we scaled the fjk{xk) to have unit variance. We generated n = 50 observations, 

and compared SpaCE JAM to sparse partial correlation (Peng et al., 2009, R package space), 

graphical lasso (Yuan & Lin, 2007, R package glasso), neighborhood selection (Meinshausen 

& Biihlmann, 2006, R package glasso), nonparanormal (Liu et al., 2012; Xue & Zou, 2012, 

R package glasso), forest density estimation (Liu et al., 2011, code provided by authors), the 

method of Basso et al. (2005, R package minet), and graphical random forests (Fellinghauer 

et al., 2011, code provided by authors). In performing neighborhood selection, we declared an 
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Fig. 2: Simulation study. The number of correctly estimated edges is displayed as a function of 
incorrectly estimated edges, for a range of tuning parameter values, in the non-linear (left) and 
Gaussian (right) set-ups, averaged over 100 simulated data sets. Dots indicate the average model 
size chosen using the BIC criterion. In the order of appearance in the legend, the competing 
methods are those of Liu et al. (2012); Basso et al. (2005); Liu et al. (2011); Fellinghauer et al. 
(2011); Yuan & Lin (2007); Meinshausen & Buhlmann (2006); Peng et al. (2009). 



edge between the jth and fcth variables if J3jk 7^ or J3kj 7^ 0. We performed SpaCE JAM using 

- -> —2 —s 1 



three sets of basis functions: ^jk = [x/cX^], ^jfc = [xfc,x^], and ^j^ — ^ ^- -^^ ^^ 



5-2. Simulation results 

Figure 2 summarizes the results of our simulations. For each method, the numbers of correctly 
and incorrectly estimated edges were averaged over 100 simulated data sets for a range of 100 
tuning parameter values. When the /jfc(-) are non-linear, SpaCE JAM with the basis ^j^ = 
[xk,x\,x\] dominates SpaCE JAM with the basis sets '^ jk = [xk,x\] or [xk,x\], which in 
turn tend to enjoy superior performance relative to all other methods (left panel of Figure 2). 
Furthermore, even though the basis sets ^jk = [xk,x\] and [ x^ , x| ] do not entirely capture the 
functional forms of the data-generating mechanism, they still outperform methods that assume 
linearity, as well as competitors intended to model non-linear relationships. 

When the conditional means are linear and the number of estimated edges is small, all methods 
perform roughly equally (right panel of Figure 2). As the number of estimated edges is increased, 
sparse partial correlation performs best, while the graphical lasso, the nonparanormal and the 
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forest-based methods perform worse. This agrees with the observations of Peng et al. (2009) 

that sparse partial correlation and neighborhood selection tend to outperform the graphical lasso. 
In this setting, since non-linear terms are not needed to model the conditional dependence re- 
lationships, sparse partial correlation outperforms SpaCE JAM with two basis functions, which 
performs better than SpaCE JAM with three basis functions. Nonetheless, the loss in accuracy 
due to the inclusion of non-linear basis functions is not dramatic, and SpaCE JAM still tends to 
outperform other methods for non-Gaussian data, as well as the graphical lasso. 

5-3. Application to cell signaling data 

We apply SpaCE JAM to a data set consisting of measurements for 1 1 proteins involved in 
cell signaling, under 14 different perturbations (Sachs et al., 2005). To begin, we consider data 
from one of the 14 perturbations (n = 911), and compare SpaCE JAM using cubic polynomials 
to neighborhood selection, the nonparanormal skeptic, and graphical random forests with sta- 
bility selection. Minimizing the BIC for SpaCE JAM yielded a graph with 16 total edges. We 
compared SpaCE JAM to competing methods, selecting tuning parameters such that each result- 
ing estimated graph contained 16 edges, as well as 10 and 20 edges for the sake of comparison. 
Figure 3 displays the estimated graphs, along with the directed graph presented in Sachs et al. 
(2005). 

The graphs estimated using different methods are qualitatively different. If we treat the di- 
rected graph from Sachs et al. (2005) as the ground truth, then SpaCE JAM with 16 edges cor- 
rectly identifies 12 of the edges, compared to 11, 9, and 8 using sparse partial correlation, the 
nonparanormal skeptic, and random forests, respectively. 

Next we examined the other 13 perturbations, and found that for graphs with 16 edges, SpaCE 
JAM chooses on average 0.93, 0.64 and 0.2 more correct edges than sparse partial correlation, 
nonparanormal skeptic, and graphical random forests, respectively (p = 0.001, 0.19 and 0.68 
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using the paired t-test). Since graphical random forests does not permit arbitrary specification of 

graph size, when graphs with 16 edges could not be obtained, we used the next largest graph. 

Sparse partial Non- Random 

# Estimated correlation paranormal forest SpaCEJAM 

edges 



Sachs et al (2005) 

PIP2 
PIP3 . 



20 



p44.4; 



pakts4' 




16 



10 




Fig. 3: Cell signaling data set; graph reported in Sachs et al. (2005) is shown on the left. On 
the right, graphs were estimated using data from one perturbation of the data set. From top to 
bottom, panels contain graphs with 20, 16 and 10 edges. From left to right, comparisons are to 
Peng et al. (2009); Liu et al. (2012); Fellinghauer et al. (2011). We cannot specify an arbitrary 
graph size using graphical random forests, so graph sizes for that approach do not match exactly. 



In Section 1, we showed that these data are not well-represented by linear models even after the 
nonparanormal transformation. The superior performance of SpaCE JAM in this section confirms 
this observation. The differences between the SpaCE JAM and graphical random forests results 
indicate that the approach taken for modeling non-linearity does affect the results obtained. 



6. Extension to directed graphs 

In certain applications, it can be of interest to estimate the causal relationships underlying a set 
of features, typically represented as a directed acyclic graph. Though directed acyclic graph esti- 
mation is in general NP-hard, it is computationally tractable when the causal ordering is known. 
In fact, in this case, a modification of neighborhood selection is equivalent to the graphical lasso 
(Shojaie & Michailidis, 2010b). We extend the penalized likelihood framework of Shojaie & 
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Fig. 4: Simulation example with directed acyclic graphs. The simulation is exactly as in Sec- 
tion 5- 1 and Figure 2. For each method, the number of correctly and incorrectly estimated edges 
are averaged over 100 simulated data sets, for a range of 100 tuning parameter values. The com- 
peting method is that of Shojaie & Michailidis (2010b). 



Michailidis (2010b) to non-linear additive models by solving 



minimize < 

/3,fc,2<i<p,fc-<i 



1 

2n' 



^"^jkPjkWl + A^ ||^jfc/3jfc||2 



k-<j k^j 

where A; -< j indicates that k precedes j in the causal ordering. When "^jk = Xk, the model is 

exactly the penalized Gaussian likelihood approach of Shojaie & Michailidis (2010b). 

Figure 4 displays the same simulation scenario as Section 5-1, but with the directed graph 

estimated using the (known) causal ordering. Results are compared to the penalized Gaussian 

likelihood approach of Shojaie & Michailidis (2010b). SpaCE JAM performs best when the true 

relationships are non-linear, and performs competitively when the relationships are linear. 

7. Theoretical Results 

In this section, we provide theory for consistency of the SpaCE JAM graph estimate. Here, 
we focus on theory for undirected graphs. Similar results also hold for directed graphs, but we 
omit them due to space considerations. The theoretical development follows that of sparsistency 
results for sparse additive models with orthogonal series smoothers (Ravikumar et al., 2009). 

First, we must define the graph for which SpaCE JAM is consistent. Recall that we have the 
random vector x = (xi, . . . , Xd) ~ V, and X = [xi, . . . ,Xd] G M"^'^ is a matrix where each 
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row is an independent draw from V. For each (j, k) ^V xV consider the orthogonal set of 

basis functions ipjkt{-): i G N. Define the population level parameters /3*^ G M°^ as 



{P*jk^k = l,...,d] = argmin l^Xj -^^^jkt{xk)l^jkt? \ , j = ^,---,d. 
Let Sj = {k : ||/3*J| / 0} and Sj = \Sj\. Let fjk{xk) = EZi i^M^k)(3*kt e ^- Then 

xj = J2 fjk{xk) + ej, j = l,...,d, 

feeSj 

where ei, . . . , e^ are residuals, and J2keS- fjk{xk) is the best additive approximation to ¥,[xj \ 
{xk : k / j}], in the least-squares sense. We wish to determine which of the fjk{-) are zero. 

On observed data, we use a finite set of basis functions to model the fjk{-)- Denote the set of r 
orthogonal basis functions used in the regression of Xj on Xk as '^jk = [ipjki {xk), ■ ■ ■ , ipjkr {xk)], 
a matrix of dimension n x r such that ^J/^^jk/n = 1^. Let P*^ = [/3*^^, . . . , /?|fcj.]"^ denote the 
first r components of /3*^. Further, let '^ Sj be the concatenated basis functions in {^jk '■ k G 
Sj}, thus ^Sj is a matrix of dimension n x Sjr. Also let 'Ssj,Sj = ( k^^^Sj ) and T,jk^Sj = 



^^J^^5 ) . Define the sub-gradient of the penalty in (7) with respect to Pjk as gjk{P)- On the 
set Sj, we write the concatenated sub-gradients as (75 , a vector of length Sjr. 

Let $ be the parameter estimates from solving (7), let En = {{j, k) : \\(3jk\\2 + IIAjlli / 0} 
be the corresponding estimated edge set, and let E* = {{j, k) : k ^ Sj or j ^ Sk} be the graph 
obtained from the population level parameters. In Theorem 1, we give precise conditions under 
which pr(£^„ = E'*) — ^ 1 as n — ^ 00. 



Theorem 1. Let the functions fjk be sufficiently smooth, in the sense that if f\J 



cir) 



m=i'4^jkt{xk)fi*jkt' ^^^« \fjk{^k) - fjk{xk)\ = Op(l/r™) uniformly in {j,k) e V x V for 
some 771 G N. For j = 1, . . . ,d, assume the basis functions satisfy A„iin{T,s-^s) ^ Cmin > 
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with probability tending to 1. Assume the irrepresentability condition, 



\^jk,s,^slsj9Sj\\l + \\T.kj,Sk^slsJSk\\l < 1 - 5, 



(11) 



holds for (j, k) ^ E* and some 6 > with probability tending to 1, where gs = Qs iP)- Assume 
the following conditions on the number of edges \E*\, the neighborhood size Sj, the regulariza- 
tion parameter A, and the truncation dimension r: 



rlog(r\E*''\) ^ rsj\og(r\E*\ 

^ -^ 0, max -^ — -5 

A^n j A^n 



7 
0, max — ^- — > 0, and 

j r'^A 



— max 
p* j 



Sjrlog(r|£'*| 
n 



1/2 



+ 



■^i I \/^„a1/2 



+ A(rsj 



where p* = min,- min^g^ 



. Further, assume the variables 



J wyjkWoo 
ijkt = ^jkt{xk)^j for j, /c G y, and J = 1, . . . , d 

have exponential tails, that is pr[ \ijkt\ > A — o,&~^^ for some a,b > 0. 

Then, the SpaCE JAM graph estimate is consistent: pr{En = E*) ^ 1 as n ^ 00. 



8. Extension of SpaCE JAM to high dimensions 

In this section, we propose an approximation to SpaCE JAM that can speed up computations 
in high dimensions. Our proposal is motivated by recent work in the Gaussian setting by Wit- 
ten et al. (2011) and Mazumder & Hastie (2012). They showed that for the graphical lasso (4), 
the connected components of the estimated conditional independence graph are precisely the 
connected components of the estimated marginal independence graph, where the jth and fcth 
variables are considered marginally independent when | xjxk \ < A. Consequently, one can ob- 
tain the exact solution to the graphical lasso problem in substantially reduced computational time 
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by identifying the connected components of the marginal independence graph, and solving the 

graphical lasso optimization problem on the variables within each connected component. 

We now apply the same principle to SpaCE JAM in order to quickly approximate the solution 
to (7) in high dimensions. Let pm = supj „gjr p{f{xk),g{xj)) be the maximal correlation be- 
tween Xj and Xk over the univariate functions in T such that /{x^) and g{xj) have finite variance. 
Define the marginal dependence graph Tm = (V, Em), where (j, k) G Em when pH ' ^ 0. If 
the jth and fcth variables are in different connected components of Tm, then they must be condi- 
tionally independent in the large-sample SpaCE JAM graph. Theorem 2, proven in the Appendix, 
makes this assertion precise. 

Theorem 2. Let Ci, . . .Ci be the connected components o/Tm- Suppose the space of func- 
tions T contains linear functions. If j G C„ and k ^ Cufor some I < u < I, then (j, k) ^ E*. 

Theorem 2 forms the basis for Algorithm 2. There, we approximate the maximal correlation 
using the canonical correlation (Mardia et al, 1980) between the basis expansions "^kj and "^jk- 

Pm = max^,,^gMr pi^jkv, ^jkw). 

Algorithm 2. A fast approximation for SpaCE JAM in high dimensions 

1: For (j, k) ^V xV , calculate pm , the sample canonical correlation between '^kj and ^jk- 
2: Construct the marginal independence graph: (j, k) G Tm when \pm \ > A2. 
3: Find the connected components Ci, . . . Q of Tm- 
4: Perform Algorithm 1 on each connected component. 

In order to show that i) Algorithm 2 provides an accurate approximation to the original SpaCE 
JAM problem, ii) the resulting estimator outperforms methods that rely on Gaussian assumptions 
when those assumptions are violated, and iii) Algorithm 2 is indeed faster than Algorithm 1 , we 
replicated the graph used in Section 5-1 five times. This gives d = 500 variables, broken into five 
components. We took n = 250, and set "^ jk = [^k-,x\^x\]. 
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Fig. 5: Performance of SpaCE JAM using Algorithm 2. The number of correctly and incorrectly 
estimated edges are averaged over 100 simulated data sets, for each of 100 tuning parameter 
values. SpaCE JAM was applied using cubic polynomials as basis functions. The competing 
method is that of Meinshausen & Buhlmann (2006). 

In Figure 5 we see that when A2 in Algorithm 2 is small, there is little loss in statistical 
efficiency relative to the full SpaCE JAM algorithm (Algorithm 1), which is a special case of 
Algorithm 2 with A2 = 0. Further, we see that SpaCE JAM outperforms neighborhood selection 
even when A2 is large. Using Algorithm 2 with A2 = 0.5 and A2 = 0.63 led to a reduction in 
computation time over Algorithm 1 by 25% and 70%, respectively. 

(jk) 

We note here that Theorem 2 continues to hold if maximal correlation pm is replaced with 

(jk) (jk) 

some other measure of marginal association p; , provided that pi dominates maximal corre- 

(jk) (jk) 

lation in the sense that pi =0 implies that pm = 0. That is, any measure of marginal asso- 
ciation, such as mutual information, which detects the same associations as maximal correlation 

(jk) (jk) 

(i.e. pi / if /9m / 0) can be used in Algorithm 2. 



9. Discussion 

In this paper we have discussed conditional independence graph estimation for non-normal 
data. In the high-dimensional setting, assumptions on the joint distribution of a set of vari- 
ables cannot reasonably be expected to hold, and cannot be checked. Therefore, we have pro- 
posed SpaCE JAM, which models conditional distributions using flexible additive models, and 
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thereby gives more accurate graph estimation for non-normal data. The R package space jam 

at cran . r-pro ject . org/package=space jam implements the proposed approach. 

A possible extension to this work involves accommodating temporal information. We could 

take advantage of the natural ordering induced by time, as considered by Shojaie & Michailidis 

(2010a), and apply SpaCE JAM for directed graphs. We leave this to future work. 
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Appendix 1: Technical proofs 
A • 1 . Proof of Theorem 1 
First, we restate a theorem which will be useful in the proof of the main result. 

Theorem A1. (Kuelbs & Vidyashankar 2010) Let {S,n,j,i '■ i = i, . ■ ■ ,n,j £ An} beasetof 
random variables such that ^n,j,i is independent of^n,j,i' for i 7^ i'- That is, S,n,j,i, i = 1, ■ ■ ■ ,n 
denotes independent observations of feature j, and the features are indexed by some finite An. 
Assume K[S,n,j,i] = 0, and there exist constants a > I and 6 > such that pr(|^nj,i| > 2;) < 
QQ-bx yjgj. ^^ X > 0. Further, assume that \An\ < 00 for all n and that \An\ -^ 00 as n ^ 00. 
Denote Zn,j = Y17=i ^n,j,i- Then 

maxjg^„ \zn,j\ ^ ^ // log(IAt l)^ ^^^ 
n ^ I \ n 

We now prove Theorem I . 

Proof. First, /3 is a solution to (7) if and only if 

- - ^5 \Sj-Y^ %z/3j7 I + Xgjk 0) = for ij, k)eVxV, (Al) 

"" V '^^- / 

where gjk{$) is the vector satisfying 

\\9jkm\l + \\gkjm\l<l when ||/3,-fe||2 + ||/3fc,-||2 = 0. 

We base our proof on the primal-dual witness method of Wainwright (2009). That is, we construct 
a coefficient-subgradient pair (/3, g), and show that they solve (7) and produce the correct sparsity 
pattern, with probability tending to 1. For (j, k) ^ E*, we construct /3jfc and the corresponding 
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sub-gradients gjk using SpaCE JAM, restricted to edges in E* : 



arg mm 



( ^ d 

[_ 3=1 k&Sj {j,k)eE* 



\'^jk(3jk\\2 + ll^fci/^jfclb) 



2n1/2 



(A2) 

For {j, k) G E*^, we set (3^^ = 0, and use (Al) to solve for the remaining Qj^ when k ^ Sj. Now, 
/3 is a solution to (7) if 

hikCMWl + \\9kj{Pkj)h < 1 for (i, k) i E\ (A3) 

In addition. En = E* when 

/3s, /Ofori = l,...,d. (A4) 

Thus, it suffices to show that that Equations (A3) and (A4) hold with high probability. 



Condition (A4): We start with the 'primal' problem. The stationary condition for Ps- is given by 

Denote by ^^^5 fjk{xj) — fjki^j) = Wj the truncation error from including only r basis 
terms. We can write Xj = ^s(3*s + Wj + ej. And so 



n 



^s, {^sAf^s,-P7:')-^j-^j)+^9s,=o, 






or 



0s, - Pt^) 



n 



^%^S, 



-1 



n ^ n ^ V 



using the assumption that -^^^5 is invertible. We will now show that the inequality 



5. 



max WPsj - P's' ' lloo < mm mm n^^-j. noo 



j keSj 



o*(r)| 



/2^p*/2 



(A5) 



(A6) 



3*(r) I 



holds with high probability. This implies that ||/3jfc||2 / if ||/3.^ II2 / 0. 
From (A5) we have that 



max II^Q. - f3*J-'^' Woo < max 






+ max 

j 






+ max A 

j 



^s!.sJs, 



= Ti+T2 + T3. 
Thus, to show (A6) it suffices to bound Ti, T2, and T3. 



• Bounding Ti : 

By assumption, we have that \fjl.{xk) — fjk{xk)\ = Op{l/r"^) uniformly in k. Thus, 



n-i/2| 



Wjh 



1/" Y.keSj fjk i^k) - fjk{xk) = Opi-Sj/r"") uniformly in j. 
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This implies that 



Ti < max 

i 



-| J. ryi 

TlQ Q —'^ q Wj 



< max 

2 ^ 



^~' 



Sj ,Sj 



\Jn 



^\ 



\Jn 



Filb 



1/2 ^ , , ^, ^ /max,s, 






j = (Amm (S^^-^sJ)' 



In the above, we used that Am 
• Bounding T2. 

Here, we use Theorem Al which bounds the ^oo norm of the average of high-dimensional i.i.d. 
vectors. First, by the definition of e^ we must have that E[ V'j7ci(xfc)ej ] = 0, i.e. the residuals 
are uncorrelated with the covariates. 

Let Zj\^t = 'ipjkt{^k)'^^jy which is sum of n independent random variables with exponential 
tails. We have that 

max ||^_5,ej||oo/'T' = maxmax max \zjkt\/n < max max {|-Zjfct| V |zfcjt|/n} , 



j keSj t=l,...,r 



(j,k)€E'- t=l, 



the maximum of 2r|ii^*| elements. We can thus apply Theorem Al, with An indexing the 
2r|E'*| elements above, to obtain 



T2 = max 
j 



S 



1 



Si, Si „ Sj J 



^ie. 



< max(rs,f /2C-L0, 



< max 

j 



S 



-1 

Sj ,Sj 






loe(2r|£;*| 



n 



1/2N 



o« 



/maxj Sjrlog{r\E*\ 
V n 



1/2N 



Bounding T3: 

We have that ||5jA:||2 < 1 for {j, k) £ E*, so 



T-j < A max 



S 



-1 



< Amax(rSj) ' 

00 j 



^-1 



< A max 

2 j C, 



(rsj)i/2 



Altogether, we have shown that 



max \\l3sj - 13 g 



*(r) I 



< o„ ( — ^^^ ) + o. 



J^maxj s j ) r log (r | £■* | 



n 



1/2^ 



(''Sj) 



1/2 



+ A max 



By assumption. 



— max 
p* j 



Sjr log(r|£'*| 



n 



1/2 



^■?' I Xi'^.Al/S 



+ -:^ + Hrsj 



which implies that maxj \\l3s- — (3*g ||oo < P*/2 with probability tending to 1 as n — ;• 00. 
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Condition (A3): We now consider the 'dual' problem. That is, we must show that H^jfclb + 
WOkj lb < 1 for each {j, k) ^ E* . From the discussion of Condition (A4), we know that 



gjk 



^],i^>sAPs,-^') 



Wi 



An'^"'^^"^^""^- '^^■' ' ' 



^3 



\n 



'^lk('^s,^s],sA-'-s,^3 



*(r) 
•j 



n ^"^ 



A55, 



Wi 



'-*tJ/ ' 



Xn ^^ 



^"^s.^sU^Dw, 



Xn ^^\ n ^' ^^'^^ ^^ 



= Ml'' + Ml'' + Mi^ . 

We will proceed by bounding \\m{''\\2 + ||Mf^'||2, HM^'^'lb + \\M^^\\2 and ||M|^||2 + \\M^^\\2, 
which will give us a bound for the quantity of interest, H^jfc II2 + Wokj lb- 



• Bounding Mi. 

When bounding Ti earlier, we saw that ^^"'^'^lltt'jlb = Op{sj/r"^). Now 
(/ — ^5 S^^^ \1'^ /n) is a projection matrix, and by design n^^/^^j^ is orthogonal, 
so that all the eigenvalues of n^^'^^jk are 1. Therefore 



||Mf II2 < ^n-i/2||^ 



jk\\2n 



'1/2 



ti; 



'JII2 



O, 



Xr^ 



and 



|Mf||2 + ||Mf^'||2<0„f^^^ 



Xr^' 



which tends to zero because 
• Bounding M2: 
First, note that 



Ar" 



uniformly in j. 



n-'^''^s,^s;,s 



-1 



A||Mf II2 < n-i||^Jfce,||2 + n-V2||^^.^||^ 

< y/r\\'^Ji^ej\\oo/n + (rsj/Cmin)^^'^ \\'^'sXj\\oo/n. 
Then, applying Theorem A 1, as in the bound for T2, we get 



2 "^J -^ 



A max WM^'^h < Op 

{j,k)(^E*-" 



rlog{r\E*^ 



n 



1/2N 



+ 0, 



r maxj Sj log(r|£'*P^ ' 



Thus, maxQ-^;j)g^*c <! HM^ II2 + IIM2 II2 f — ^ when 



rlog{r\E*'' 
X^n 



rSjlog(r|^*| 
and max - 



X^n 



• Bounding M3: 
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By the irrepresentability assumption, we have that HMg H^ + H^fs"'!!! < 1 — <5 with proba- 
bihty tending to 1 . 

Thus, since ||Mj''''||2 + ||Mf^'||2 + ||M^'''||2 + ||M2-''||2 = Op(l), we have that for each (j, /c) G 

max {ll^jfclb + llfffcjib} < 1 -<J 
(i,fc)e£;*'= 

with probability tending to 1. Further, since we have strict dual feasibility, i.e. H^jfclb + ll^fcjlb < 
1 for {j, k) G E*^, with probability tending to 1, the estimated graph is unique. D 

A • 2. Proof of Theorem 2 
Proof. Consider a variable j, with j ^ Cu- Our large-sample model requires minimizing 

K\xj — J2ky^j Ylt^i '4'jkt{xk)Pjkt\'^ with respect to the j3jkt, or equivalently, minimizing 

IE|xj -^fjk{xk)?' 
over functions fjk G J^- We have that 

nxj-Y,f,k{xk)? = ^x] -2Y,nx3fjk{xk)]+Y,Y.^^fik^''^)f^'^''i^^ 

k^j k^j k^j Ij^j 

= Ex] - 2 Y, nxjfjkixk)]- 2 Y, nxjfjk{xk)]+ y Y1 ^[fMxk)fjiixi)] 

keC'u k^Cu k&Cu l&Cu 

+ Y. 2^IE[/,fc(xfc)/,KxO] + 2 Y J]lE[/ife(a;fc)/,/(x/)]. 

k(f,Cu l^Cu k^Cu l&Cu 

By assumption T^kic^^l^jfikixk)] = Y^k^c^ J2i(^Cu^[fjk{xk)fji{xi)] = 0. Thus, collect- 
ing terms, we get 

nx,-Yfjkixk)\^ = nxj- Y fjkixk)\^+E\ Y fjkixk)\^- 

kf^j keCu k^Cu 

Minimization of this quantity with respect to {fjk & T,k ^ Cu} only involves the last term, 
which achieves its minimum at zero when fjk{-) = almost everywhere for each k ^ Cu- 
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